Univariate Plots Section

## [1] 1599   12
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

The mean and median for red wine quality is 5.636 and 6. From the distribution graph, few red wines are in either low quality or very high quality. The distribution appears to be Gaussian.

To simply our visulization, I created a categorical variable quality_level with levels [low, medium, high] to group the quality.

residual.sugar and chlorides have more outliers than other variables. citric.acid has 132 observations that equal to 0 and the distribution has another peak at 0.49. density and pH are normally distributed. Also notice that density and pH are both distributed in a very small range. residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, sulphates and alcohol are right-skewed.

Using log transformation on the right-skewed variabled produces more normally distributed distributions.

Univariate Analysis

What is the structure of your dataset?

There are 1599 observations in the dataset with 12 features(fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality, quality_level). All variables are numeric and quality is integer. Below are some observations:

  • Most red wines are medium quality.
  • Densities of red wines are mostly in the range [0.99,1].
  • Red wines have low pH values, the median is 3.310.
  • The median percent of alcohol in red wines is 10.20.
  • Around 75% of red winds have residual sugar less than 2.6 after fermentation stops.

What are the main features of interest in your dataset?

The main feature in this dataset is the quality of red wine. I will further examin the relationship of each variable with the quality and select the suitable variables to build predictive model.

What other features in the dataset do you think will help support your investigation into your feature of interest?

There are some variables that might provide similar information, for example sulphates, free.sulfur.dioxide and total.sulfur.dioxide, three kinds of acids. I assume that there are three general groups of features that are of interest: acid(pH), alcohol and sulphates.

Did you create any new variables from existing variables in the dataset?

I created a new variable quality_level to simpify visualization.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

There are six variables(residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, sulphates and alcohol) that are right-skewed. I performed log transformation on the data to make them more normally distributed. The assumption for the linear regression is that variables are normally distributed. Using the transformed vairables woule be more robust when building linear regression model later.

Bivariate Plots Section

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity           1.00000000     -0.256130895  0.67170343
## volatile.acidity       -0.25613089      1.000000000 -0.55249568
## citric.acid             0.67170343     -0.552495685  1.00000000
## residual.sugar          0.11477672      0.001917882  0.14357716
## chlorides               0.09370519      0.061297772  0.20382291
## free.sulfur.dioxide    -0.15379419     -0.010503827 -0.06097813
## total.sulfur.dioxide   -0.11318144      0.076470005  0.03553302
## density                 0.66804729      0.022026232  0.36494718
## pH                     -0.68297819      0.234937294 -0.54190414
## sulphates               0.18300566     -0.260986685  0.31277004
## alcohol                -0.06166827     -0.202288027  0.10990325
## quality                 0.12405165     -0.390557780  0.22637251
##                      residual.sugar    chlorides free.sulfur.dioxide
## fixed.acidity           0.114776724  0.093705186        -0.153794193
## volatile.acidity        0.001917882  0.061297772        -0.010503827
## citric.acid             0.143577162  0.203822914        -0.060978129
## residual.sugar          1.000000000  0.055609535         0.187048995
## chlorides               0.055609535  1.000000000         0.005562147
## free.sulfur.dioxide     0.187048995  0.005562147         1.000000000
## total.sulfur.dioxide    0.203027882  0.047400468         0.667666450
## density                 0.355283371  0.200632327        -0.021945831
## pH                     -0.085652422 -0.265026131         0.070377499
## sulphates               0.005527121  0.371260481         0.051657572
## alcohol                 0.042075437 -0.221140545        -0.069408354
## quality                 0.013731637 -0.128906560        -0.050656057
##                      total.sulfur.dioxide     density          pH
## fixed.acidity                 -0.11318144  0.66804729 -0.68297819
## volatile.acidity               0.07647000  0.02202623  0.23493729
## citric.acid                    0.03553302  0.36494718 -0.54190414
## residual.sugar                 0.20302788  0.35528337 -0.08565242
## chlorides                      0.04740047  0.20063233 -0.26502613
## free.sulfur.dioxide            0.66766645 -0.02194583  0.07037750
## total.sulfur.dioxide           1.00000000  0.07126948 -0.06649456
## density                        0.07126948  1.00000000 -0.34169933
## pH                            -0.06649456 -0.34169933  1.00000000
## sulphates                      0.04294684  0.14850641 -0.19664760
## alcohol                       -0.20565394 -0.49617977  0.20563251
## quality                       -0.18510029 -0.17491923 -0.05773139
##                         sulphates     alcohol     quality
## fixed.acidity         0.183005664 -0.06166827  0.12405165
## volatile.acidity     -0.260986685 -0.20228803 -0.39055778
## citric.acid           0.312770044  0.10990325  0.22637251
## residual.sugar        0.005527121  0.04207544  0.01373164
## chlorides             0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide   0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide  0.042946836 -0.20565394 -0.18510029
## density               0.148506412 -0.49617977 -0.17491923
## pH                   -0.196647602  0.20563251 -0.05773139
## sulphates             1.000000000  0.09359475  0.25139708
## alcohol               0.093594750  1.00000000  0.47616632
## quality               0.251397079  0.47616632  1.00000000

From the correlation table, we can see that most variables have very small correlation coefficients with quality. Variable alcohol has the highest correlation with quality. Meanwhile, volatile.acidity and sulphates have relatively higher correlation, I will further analyze three variables with red wine quality.

First we take a look at the relationship between [alcohol, volatile.acidity, sulphates] and quality. Since quality is a discrete variable, it is more straightforward to look at the boxplots. Note that alcohol and sulphates in the following analysis are log transformed.

## red$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.580  11.000 
## -------------------------------------------------------- 
## red$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## red$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## red$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## red$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## red$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

We can see from the medians and quartiles of each boxplot that as the percentage of alcohol increases, the score for quality also increases.

## red$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4400  0.6475  0.8450  0.8845  1.0100  1.5800 
## -------------------------------------------------------- 
## red$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.230   0.530   0.670   0.694   0.870   1.130 
## -------------------------------------------------------- 
## red$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.180   0.460   0.580   0.577   0.670   1.330 
## -------------------------------------------------------- 
## red$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.3800  0.4900  0.4975  0.6000  1.0400 
## -------------------------------------------------------- 
## red$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4039  0.4850  0.9150 
## -------------------------------------------------------- 
## red$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2600  0.3350  0.3700  0.4233  0.4725  0.8500

volatile.acidity has a negative correlation with quality. Higher the volatile.acidity, lower the quality.

## red$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5125  0.5450  0.5700  0.6150  0.8600 
## -------------------------------------------------------- 
## red$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.4900  0.5600  0.5964  0.6000  2.0000 
## -------------------------------------------------------- 
## red$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.370   0.530   0.580   0.621   0.660   1.980 
## -------------------------------------------------------- 
## red$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5800  0.6400  0.6753  0.7500  1.9500 
## -------------------------------------------------------- 
## red$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3900  0.6500  0.7400  0.7413  0.8300  1.3600 
## -------------------------------------------------------- 
## red$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6300  0.6900  0.7400  0.7678  0.8200  1.1000

sulphates has positive correlation with quality.

Besides, from the correlation matrix, fixed.acidity has high correlations with citric.acid, pH and density. free.sulfur.dioxide and total.sulfur.dioxide are highly correlated as I suggested earlier.

Notice that by the definition of density, it is “the density of water is close to that of water depending on the percent alcohol and sugar content”. It should be strongly correlated with residual.sugar and alcohol. Nonetheless, it does have strong correlation with alcohol, but the strongest correlation is with fixed acidity.

Also, I assume the sulphates group should have strong correlation with each other earlier. But in fact, sulphates have very little correlation with free.sulfur.dioxide and total.sulfur.dioxide.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Many features have very low correlation with quality, especially residual.sugar, free.sulfur.dioxide and pH, which are near zero.

alcohol has the strongest correlation with quality, the other two feature with correlation coefficients larger than 0.25 are volatile.acidity and sulphates.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

sulphates have low correlation with free.sulfur.dioxide and total.sulfur.dioxide.

density has strong correlation with fixed.acidity rather than residual.sugar and alcohol.

free.sulfur.dioxide has strong correlation with total.sulfur.dioxide.

What was the strongest relationship you found?

The quality of red wine is positively correlated with percentage of alcohol and negativel correlated with volatile acidty.

Multivariate Plots Section

To further examine the top two variables of the highest correlation with quality, I create the graph below:

High quality red wines tend to have higher alcohol values and lower volatile.acidity. Producing the same graphs for other two pairs of variables:

Given that even the largest correlation coeffient is still quite low, there is no surprise that the R-squared in our linear regression model is extremely low. Even adding all variables, the R-squared is only 0.368.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Better quality red wines have higher alcohol and sulphates values, lower volatile acidty.

Overall, none of the features have strong correlation with quality.

Were there any interesting or surprising interactions between features?

quality is positively correlated with sulphates but negatively with total.sulfur.dioxide. By definition, SO2 acts as an antimicrobial and antioxidant. But high-level SO2 will affect the smell and taste of red wine. We could intuitively conclude that higher the SO2, lower the quality. But sulphates is different from the sulfer dioxide level. It is a wine additive.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I created a linear regression model to predict the red wine quality. The model has very low R^2 in general. Our top three most correlated features contribute 34.6% of total variance. Adding all feature to the model, they explain 36.8% of total variance in red wine quality. Since the correlation coefficients are all quite low, they do not fit very well with the assumption of linear model (that variables have linear correlation with each other). It is not suggestive to run regression model on this dataset.

Final Plots and Summary

Plot One

Description One

The distribution of red wine quality is nearly normal. Most red wines are among the medium quality probably because good-quality red wines are hard to produce and most customers can only afford medium-quality red wines.

Plot 2

Description Two

Red wine quality is positively correlated with alcohol percentage and negatively with volatile acidity. The highest quality red wine has medium 12.5% of alcohol and 0.37 volatile acidity level.

Plot Three

Description Three

Higher quality red wine has higher level of alcohol and lower level of volatile acidity. The linear relationship is not strong as the graph shows.

Reflection

The red wine data set has 1599 observations which are collected in 2009. Red wines are the variants of the Portuguese “Vinho Verde” wine. The quality of red wine is scored by experts on a 0 to 10 level.

I first start by univariate analysis. Almost half of the variables are right-skewed in the dataset. I performed log transformation on these variables. Since there are some similarity among these variables, I suppose that the quality of red wine is mainly affected by three groups of indexes (sulper dioxide, alcohol and acidity). Then I continue on the bivariate and multivariate analysis. It turned out that my supposition is partially correct. The top three correlated variables are alcohol, volatile acidity and sulphates. There are some interesting correlations among the features. Nevertheless, all features have very low correlation with the red wine quality.

Given the low correlation, there is no surprise that the linear regression model performs not very well. All variables in the dataset explain only 36.8% of red wine quality total variance. The limitation of our model is very obvious. It is best not to use linear model in this case.

I think the issue mentioned above might come from our original dataset.

  1. The dataset evaluates only one source of red wine. There are many other kinds of red wines. Including only one kind of red wine will make our dataset biased when we want to predict on other kind of red wines.

  2. Not enough observations and variables. We can see from the correlation table that several variables are correlated. It is better if we have more indicators for the red wine. The numbers of both low and high quality red wine are quite small. The analysis based on such small number of observations might not be accurate.